Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent#354
Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent#354samikshya-db wants to merge 8 commits into
Conversation
1300f1a to
fbb0263
Compare
After v1.11.0 enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on /telemetry-ext. Three issues compounded: 1. Double-retry. The exporter ran its own retry loop on top of the retryablehttp-wrapped HTTP client (internal/client.RetryableClient), which already retries 429/5xx with Retry-After. Result: up to RetryMax * (MaxRetries+1) HTTP attempts per export, all collapsed into one circuit-breaker outcome — so the breaker barely opened. 2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs sent no User-Agent, so 429s were tagged Go-http-client/1.1 and could not be attributed to godatabrickssqlconnector by version. 3. High request volume. FlushInterval=5s, BatchSize=100. Changes: - telemetry/exporter.go: drop the retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx, Retry-After) are owned by the underlying retryablehttp client. Each export call → exactly one breaker outcome. - telemetry/exporter.go, telemetry/featureflag.go: set User-Agent header on telemetry POST and feature-flag GET. Built once at the connector site (buildUserAgent in connector.go), mirroring internal/client/client.go format (DriverName/DriverVersion + UserAgentEntry + agent product), and plumbed via TelemetryInitOptions.UserAgent. - telemetry/config.go: FlushInterval 5s → 30s, BatchSize 100 → 200. Remove MaxRetries/RetryDelay from telemetry.Config and TelemetryInitOptions; telemetry_retry_count/_delay DSN params still parse for backwards compat but are no-ops. - telemetry/circuitbreaker.go: lower minimumNumberOfCalls 20 → 10 (so low-traffic clients can still trip the breaker on a sustained outage now that each export is one signal), and raise waitDurationInOpenState 30s → 60s (respect typical Retry-After). - Tests: removed obsolete retry/backoff tests; added single-attempt assertion across 4xx/429/5xx; added User-Agent assertions on both endpoints. Co-authored-by: Isaac
fbb0263 to
d47bf99
Compare
… first flush The previous commit removed the exporter's inner retry loop, but the retryablehttp layer (RetryMax=4) was still amplifying each telemetry export into up to 5 attempts on a 429 — defeating the breaker because only the final outcome was observed. - internal/client: add WithSkipRateLimitRetry context helper. RetryPolicy treats 429 as non-retryable when the flag is set; 5xx and transport errors are unaffected. Set on telemetry POSTs and feature-flag GETs so each rate-limited request is exactly one HTTP transaction = one breaker signal. - telemetry/circuitbreaker: record server-provided Retry-After deltas from 429 responses; the open-state wait becomes max(waitDurationInOpenState, hint). Hint cleared on the next half-open -> closed transition. - telemetry/aggregator: offset the first flush by a random [0, FlushInterval) so a fleet of clients booted together does not phase-align bursts. Co-authored-by: Isaac
- Doc accuracy on retry semantics: WithSkipRateLimitRetry only suppresses 429s. Generic 5xx and transport errors are governed by ClientMethod, so telemetry contexts (which set no ClientMethod) get one-shot semantics on those failure modes; 503 still retries via the method-agnostic path. Updated SkipRateLimitRetry godoc and doExport comment to match. - Drop dead TelemetryRetryCount / TelemetryRetryDelay fields from UserConfig. The DSN keys are still consumed (so they do not leak into session params) but no longer stored. Cleaned up assertions in config_test.go and connector_test.go. - Clear retryAfterHint when the breaker transitions Open -> Half-Open, so a stale hint cannot extend a later reopen that did not carry its own Retry-After. Collapsed the two-step lock dance in execute() into one critical section. Updated godoc. - Guard rand.Int63n against a zero flushInterval in flushLoop. - Document the first-caller-wins User-Agent behaviour on the per-host telemetry client singleton, since later connections on the same host reuse the cached client. Co-authored-by: Isaac
- gofmt -s on connector_test.go, internal/config/config_test.go, telemetry/exporter.go (struct alignment + missing blank line). - internal/config/config.go: telemetry_retry_count is parsed for backwards compat then discarded, so call extract (2 returns) rather than extractAsInt (3 returns) to satisfy dogsled. - telemetry/circuitbreaker.go: drop the now-unused setState helper. execute() now transitions to half-open via setStateUnlocked under an already-held lock. Co-authored-by: Isaac
|
Must-fix
Should-fix
Test coverage
|
|
Thanks for the thorough review. All ten items addressed in a follow-up commit on this branch: Must-fix
Should-fix Test coverage All existing tests still pass; This comment was generated with GitHub MCP. |
- Broaden WithSkipRateLimitRetry → WithSkipTransientRetries to cover 429 AND 503; telemetry now collapses to one HTTP transaction per export across all transient codes, eliminating the 5xx asymmetry and any double-honour of Retry-After. - Fix latent bug: RetryPolicy returned (false, err) on the skip path, which made net/http's wrapper drop the response. Return (false, nil) so the exporter can see and act on Retry-After. - Parse Retry-After on 503 in addition to 429. - Log a one-time WARN when deprecated telemetry_retry_count / telemetry_retry_delay DSN params are present. - Document the breaker's detection-time / window-size / minimum-calls rationale in defaultCircuitBreakerConfig. - Extract BuildUserAgent into internal/client/useragent.go as the single source of truth for the Thrift + telemetry + feature-flag user-agent string. - Tests: TestIsRetryableServerResponse, TestExport_503RecordsRetryAfter, TestExport_SingleAttemptThroughRetryableClient (integration test using the real RetryableClient), TestCircuitBreaker_RetryAfterHintTakesMax, TestCircuitBreaker_HintClearedBeforeHalfOpenFailRecycle. Co-authored-by: Isaac
- RetryPolicy: only read Retry-After when we're going to use it (skip-transient-retries branch returns early). - Add BuildUserAgent table test covering plain / UserAgentEntry / detected-agent / combined paths; test clears known agent env vars first so the host environment can't leak in. - Gate each legacy-DSN-param deprecation warning behind a sync.Once so apps that reparse the same DSN per connection don't flood the log with the same notice. Co-authored-by: Isaac
Summary
After
v1.11.0enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on/telemetry-ext. Three issues compounded:internal/client.RetryableClient), which already retries 429/5xx withRetry-After. Result: up toRetryMax × (MaxRetries+1)HTTP attempts per export, all collapsed into a single circuit-breaker outcome — so the breaker barely opened against persistent throttling.User-Agent, so 429s landed in access logs tagged asGo-http-client/1.1and could not be attributed togodatabrickssqlconnectorby driver version.FlushInterval=5s/BatchSize=100.Changes
Retry behavior
telemetry/exporter.go— RemoveddoExport's retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx,Retry-After) are owned by the underlying retryablehttp client. Eachexport()call now corresponds to exactly one HTTP transaction = one breaker outcome.telemetry/config.go,telemetry/driver_integration.go— RemovedMaxRetries/RetryDelayfromtelemetry.ConfigandTelemetryInitOptions.telemetry_retry_count/telemetry_retry_delayDSN params still parse without error for backwards compatibility but are no-ops.Identifiability
connector.go— NewbuildUserAgenthelper mirroringinternal/client/client.go:295-302exactly:DriverName/DriverVersion+ optionalUserAgentEntry+ agent product.telemetry/exporter.go,telemetry/featureflag.go— SetUser-Agenton telemetry POST and feature-flag GET. Plumbed viaTelemetryInitOptions.UserAgent.Cadence and breaker tuning
telemetry/config.go—FlushInterval5s → 30s,BatchSize100 → 200.telemetry/circuitbreaker.go—minimumNumberOfCalls20 → 10 (so low-traffic clients can trip the breaker now that each export is one signal),waitDurationInOpenState30s → 60s (respect typicalRetry-After).Tests
TestExport_RetryOn5xx,TestExport_ExponentialBackoff,TestIsRetryableStatus, retry-config parsing tests).TestExport_SingleAttemptPerExportcovering 4xx/429/5xx, asserting the exporter never retries.TestExport_SetsUserAgentandTestFetchFeatureFlag_SetsUserAgent.Mitigation
While this rolls out, users can opt out via DSN:
enableTelemetry=false. Server-side: disableenableTelemetryForGoDriverfor affected workspaces.Test plan
go test ./...— all green locally./telemetry-extand/api/2.0/connector-service/feature-flags/GOLANG/...requests carrygodatabrickssqlconnector/<version>inhttp_user_agent./telemetry-extdrops after rollout.This pull request and its description were written by Isaac.